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Abstract — This paper proposes a multimodal biometric system 
using palmprint and speech signal. In this paper, we propose a 
novel approaches for both the modalities. We extract the 
features using Subband Cepstral Coefficients for speech signal 
and Modified Canonical method for palmprint. The individual 
feature score are passed to the fusion level. Also we have 
proposed a new fusion method called weighted score. This 
system is tested on clean and degraded database collected by 
the authors for more than 300 subjects. The results show 
significant improvement in the recognition rate. 

Index Terms — Multimodal biometrics, Speech signal, 
Palmprint, Fusion 

I. Introduction 

A unimodal biometric authentication, which identifies an 
individual person using physiological and/or behavioral 
characteristics, such as palmprint, face, fingerprints, hand 
geometry, iris, retina, vein and speech. These methods are 
more reliable and capable than knowledge-based (e.g. 
Password) or token-based (e.g. Key) techniques. Since 
biometric features are hardly stolen or forgotten. 

However, a single biometric feature sometimes fails to be 
exact enough for verifying the identity of a person. By 
combining multiple modalities enhanced performance 
reliability could be achieved. Due to its promising applications 
as well as the theoretical challenges, multimodal biometric 
has drawn more and more attention in recent years [1]. 
Speech Signal and palmprint multimodal biometrics are 
advantageous due to the use of non-invasive and low-cost 
speech and image acquisition. In this method we can easily 
acquire palmprint images using digital cameras, touchless 
sensors and speech signal using microphone. Existing studies 
in this approach [2, 3] employ holistic features for palmprint 
and speech signal representation and results are shown with 
different techniques of fusion and algorithms. 

Multimodal system also provides anti-spooling measures 
by making it difficult for an intruder to spool multiple biometric 
traits simultaneously. However, an integration scheme is 
required to fuse the information presented by the individual 
modalities. 

This paper presents a novel fusion strategy for personal 
identification using speech signal and palmprint features at 
the features level fusion Scheme. The proposed paper shows 
that integration of speech signal and palmprint biometrics 
can achieve higher performance that may not be possible 
using a single biometric indicator alone. We extract the 

©2012 ACEEE 
DOL01.USIR03.01.7 



features using modified canonical form method for palmprint 
and Subband Cepstral Coefficients for speech. Integrating 
these two features at fusion level, which gives better 
performance and better accuracy. Which gives better 
performance and better accuracy for both traits (speech signal 
& palmprint). 

The rest of this paper is organized as fallows. Section 2 
presents the system structure, which is used to increase the 
performance of individual biometric trait; multiple classifiers 
are combined using matching scores. Section 3 presents 
feature extraction method used for speech signal and section 
4 for palmprint. Section 5, the individual traits are fused at 
matching score level based on weighted sum of score 
technique. Finally, the experimental results are given in section 
6. Conclusions are given in the last section. 

II. System Overview 

The block diagram of a multimodal biometric system using 
two (palm and speech) modalities for human recognition 
system is shown in Figure 1 . It consists of three main blocks, 
that of Preprocessing, Feature extraction and Fusion. 
Preprocessing and feature extraction are performed in parallel 
for the two modalities. The preprocessing of the audio signal 
under noisy conditions includes signal enhancement, tracking 
environment and channel noise, feature estimation and 
smoothing [4] . The preprocessing of the palmprint typically 
consists of the challenging problems of detecting and 
tracking of the palm and the important palm features. 

Further, features are extracted from the training and testing 
images and speech signal respectively, and then matched to 
find the similarity between two feature sets. The matching 
scores generated from the individual recognizers are passed 
to the decision module where a person is declared as genuine 
or an imposter. 

III. Subband Based Cepstral Coefficients And Gaussian 
Mixture Model 

A. Subband Decomposition via Wavelet Packets 

A detailed discussion of wavelet analysis is beyond the 
scope of this paper and we therefore refer interested readers 
to a more complete discussion presented in [5] . In continuous 
time, the Wavelet Transform is defined as the inner product 
of a signal x(t) with a collection of wavelet functions y ab (t) in 
which the wavelet functions are scaled(by a) and translated 



76 



-k ACE EE 



ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 



(by b) versions of the prototype wavelet y(t). 
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Figure 1. Block diagram of the proposed multimodal biometric verification system 
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Discrete time implementation of wavelets and wavelet packets 
are based on the iteration of two channel filter banks which 
are subject to certain constraints, such as low pass and/or 
high pass branches on each level followed by a sub sampling- 
by-two unit. Unlike the wavelet transform which is obtained 
by iterating on the low pass branch, the filterbank tree can be 
iterated on either branch at any level, resulting in a tree 
structured filterbank which we call a wavelet packet filterbank 
tree. The resultant transform creates a division of the 
frequency domain that represents the signal optimally with 
respect to the applied metric while allowing perfect 
reconstruction of the original signal. Because of the nature 
of the analysis in the frequency domain it is also called 
subband decomposition where subbands are determined by 
a wavelet packet filterbank tree. 

B. Wavelet Packet Transform Based Feature Extraction 
Procedure 

Here, speech is assumed to be sampled at 8 kHz. A frame 
size of 24msec with a 10msec skip rate is used to derive the 
Subband based Cepstral Coefficients features, whereas a 
20msec frame with the same skip rate is used to derive the 
MFCCs. We have used the same configuration proposed in 
[6] for MFCC. Next, the speech frame is Hamming windowed 
and pre-emphasized. 

The proposed tree assigns more subbands between low 
to mid frequencies while keeping roughly a log-like 
distribution of the subbands across frequency. The wavelet 
packet transform is computed for the given wavelet tree, 
which results in a sequence of subband signals or equivalently 
the wavelet packet transform coefficients, at the leaves of the 
Tree. In effect, each of these subband signals contains only 
restricted frequency information due to inherent bandpass 
filtering. The wavelet packet tree is given in Figure 2. The 
energy of the sub-signals for each subband is computed and 
then scaled by the number of transform coefficients in that 
subband. 
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Figure 2. Wavelet Packet Tree 

The subband signal energies are computed for each frame 
as, 



w„ 



: Wavelet packet transform of signal x, 



(3) 
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i :subband frequency index (i=l,2...L), 

N. : number of coefficients in the i" 1 subband. 

C. Subband based Cepstral Coefficients 

As in MFCCs the derivation of coefficients is performed 
in two stages. The first stage is the computation filterbank 
energies and the second stage would be the decorrelation of 
the log filterbank energies with a DCT to obtain the MFCC. 
The derivation of the Subband Based Cepstral coefficients 
follows the same process except that the filterbank energies 
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are derived using the wavelet packet transform rather than 
the short-time Fourier transform. It will be shown that these 
features outperform MFCCs. We attribute this to the compu- 
tation of subband signals with smooth filters. The effect of 
filtering as a result of tracing through the low-pass/high-pass 
branches of the wavelet packet tree, is much smoother due to 
the balance in time-frequency representation. We believe that 
this will contribute to improved speech/speaker characteriza- 
tion over MFCC. Subband Based Cepstral coefficients are 
derived from subband energies by applying the Discrete Co- 
sine Transformation: 



SBC(n) = ^log 5, cos 



n(i -0.5) 



-7T 



,n = l,...n' 



(4) 

where n' is the number of SBC coefficients and L is the total 
number of frequency bands. Because of the similarity to root- 
cepstral [7] analysis, they are termed as subband based 
cepstral coefficients. 
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Figure 3. Block diagram for Wavelet Packet Transform based 
feature extraction procedure 

D. The Gaussian Mixture Model 

In this study, a Gaussian Mixture Model approach 
proposed in [8] is used where speakers are modeled as a 
mixture of Gaussian densities. The use of this model is 
motivated by the interpretation that the Gaussian 
components represent some general speaker-dependent 
spectral shapes and the capability of Gaussian mixtures to 
model arbitrary densities. 

The Gausssian Mixture Model is a linear combination of 
M Gaussian mixture densities, and given by the equation, 



(5) 



Where x is a D-dimensional random vector, b t (x) , i=l,...M 

are the component densities and p., i=l,...M are the mixture 
weights. Each component density is a D-dimensional 
Gaussian function of the form 

(6) 

Where fi denotes the mean vector and X, denotes the 
covariance matrix. The mixture weights satisfy the law of total 

probability, z1p,=1. The major advantage of this 
representation of speaker models is the mathematical 
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tractibility where the complete Gaussian mixture density is 
represented by only the mean vectors, covariance matrices 
and mixture weights from all component densities. 

IV. Feature Extraction Using Modified Canonical Form 
Method 

Features are the attributes or values extracted to get the 
unique characteristics from the image and speech signal. 

A. Palmprint feature extraction methodology 

Details of the algorithm are as follows: 
1) Identify hand image from background 

Our designed system is such that palmprint images are 
captured using contact-less without pegs, keeping the im- 
age background relatively uniform and relatively low inten- 
sity when compared to the hand image. Using the statistical 
information of the background, the algorithm estimates an 
adaptive threshold to segment the image of the hand from 
the background. Pixels with intensity above the threshold 
are considered to be part of the hand image. 
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Figure 5. Segmentation of ROI 

2)Locate region-of-interest 

The palm area is extracted from the binary image of the 
hand. After translating the original image into binary image, 
we find two key positioning points in the palmprint image 
using automatic detecting method. The first valley in the 
graph is the gaps between little finger and ring finger, Key 
Point 1 . The third valley in the graph is the gaps between 
middle finger and index finger, Key Point 2. The key point is 
circled in Figure 4. The hand image is rotated by 6 degrees. 
The hand images are rotated to align the hand images into a 
predefined direction. G is calculated using the key points as 
shown in the Figure 4. Since the size of the original image is 
large, a smaller hand image is cropped out from the original 
hand image after image alignment using key points. Figure 5 
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shows the proposed image alignment and ROI selection 
method. 

B. Modified Canonical Form Method 

The "Eigenpalm" method proposed by Turk and Pentland 
[9] [10] is based on Karhunen-Loeve Expression and we are 
motivated by this work for efficiently representing picture of 
images. The Eigen method presented by Turk and Pentland 
finds the principal components (Karhunen-Loeve Expression) 
of the image distribution or the eigenvectors of the covariance 
matrix of the set of images. These eigenvectors can be thought 
as set of features, which together characterized between 
images 

Let a image / (x, y) be a two dimensional array of intensity 
values or a vector of dimension n. Let the training set of 

images be l p I 2 , I , In. The average image of the set is 

defined by 



(7) 



Each image differed from the average by the vector. 

|k = ■/«-'* (8) 

This set of very large vectors is subjected to principal 
component analysis which seeks a set of K orthonormal 

vectors Vk, K=l, , K and their associated eigenvalues 

Xk which best describe the distribution of data. The vectors 
Vk and scalars Xk are the eigenvectors and eigenvalues of 
the covariance matrix: 



1 



(9) 



Where the matrix A = 0_ v ] finding 

the eigenvectors of matrix Cnxn is computationally intensive. 
However, the eigenvectors of C can determine by first finding 
the eigenvectors of much smaller matrix of size NxN and taking 
a linear combination of the resulting vectors [4]. 

The modified canonical method proposed in this paper is 
based on Eigen values and Eigen vectors. These Eigen valves 
can be thought a set of features which together characterized 
between images. 

Let p be the normalized modal matrix of I, the diagonal matrix 
is given by 



D = P~ 1 CP 



(10) 



Where 



and 



X, = sqrtC^yk*) , i,j=l,2,3,....n 
Then the quadratic form Q is given by 

O = X T D X 



(11) 



(12) 
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The following steps are considered for the feature extraction: 

• Select the palm image for the input 

• Pre-process the image 

• Determine the eigen values and eigen vectors of 
the image 

• Use the canonical form for the feature extraction. 

C. Euclidean Distance 

Let an arbitrary instance X be described by the feature 
vector 



(B) 



Where a (x) denotes the value of the r th attribute of instance x. 
Then the distance between two instances x and x is defined 



to be d(x n Xj) ; 



(14) 



r-l 



D. Score Normalization 

This step brings both matching scores between and 1 
[11]. The normalization of both the scores are done by 
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Where min c . and max c . are the minimum and maximum 

Speech Speech 



scores for speech signal recognition and min palm rim and 
max p a impniii are tne corres P on ding values obtained from 
palmprint trait. 

E. Generation of Similarity Scores 

Note that the normalized score of palmprint which is 
obtained through Haar Wavelet gives the information of 
dissimilarity between the feature vectors of two given images 
while the normalized score from speech signal gives a 
similarity measure. So to fuse both the score, there is a need 
to make both the scores as either similarity or dissimilarity 
measure. In this paper, the normalized score of palmprint is 
converted to similarity measure by 



N =l-N 

1 v Palm L ly Palm 



(17) 



V. Fusion 



The biometrics systems are integrated at multi-modality 
level to improve the performance of the verification system. 
At multi-modality level, matching score are combined to give 
a final score. The following steps are performed for fusion: 

1 . Given a query image and speech signal as input, features 
are extracted by the individual recognition and then the 
matching score of each individual trait is calculated. 

2 . The weights a and b are calculated using FAR and FRR. 

3 . Finally, the final score after combining the matching score 



79 



-ACEEE 



ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 



of each trait is calculated by weighted sum of score technique. 



Conclusions 



MS 



a*MS Palm +b*MS Speech 



fusion 



(18) 



Where a and b are the weights assigned to both the traits. 
The final matching score (MS m ) is compared against a 
certain threshold value to recognize the person as genuine 
or an imposter. 

VI. Experimental Results 

This section shows the experimental results of our 
approach with Modified Canonical method and Subband 
based Cepstral coefficients for palmprint and Speech 
respectively. We evaluate the proposed multimodal system 
on a data set including more than 300 subjects taking 6 
different samples, also we have experimented with two 
different conditions (Cleaned and Degraded data). The 
training database contains a palmprint images and speech 
signal for each individual for each subject. 

The comparison of both unimodal systems (palm and 
speech modality) and a bimodal system is given in Table 1 & 
2. It can be seen that the fusion of palmprint and speech 
features improves the verification score. The experiments 
show that EER is reduced to 3.54% in clean database and 
9.17% in degraded database. 

Table I. The FAR And FRR Of Palmprint And Speech Signal In Clean 
And Degraded Conditions 



Modality 


Method 


D stabase 


Classifier 


FAR% 


FRR/o 


Speech 
Signal 


SBC 


Clean 
database of 
300 subjects 


GMM 


C3.5- 


10.73 


PEbiprint 


MCF 




06.73 


13.24 


Spsscli 
Signal 


SBC 


Degraded 
database of 

50 subjects 


GMM 


46.67 


31.67 


Palmprint 


MCF 




23.33 


26.67 


Table II. The FAR And FRR After Fusion 



Methods 


Database 


FAR% 


FRR% 


Palmprint 


Speech 
Sienal 


MCF 


SBC 


Cl?2n database 
of 300 subjects 


02.. 76 


0432 


MCF 


SBC 


Degraded 
database of 30 
subjects 


06.67 


1167 



Biometric systems are widely used to overcome the 
traditional methods of authentication. But the unimodal 
biometric system fails in case of biometric data for particular 
trait. This paper proposes a new method in selecting and 
dividing the ROI for analysis of palmprint. The new method 
utilizes the maximum palm region of a person to attain feature 
extraction. More importantly, it can cope with slight variations, 
in terms of rotation, translation, and size difference, in images 
captured from the same person. The experimental results show 
that the performance of palmprint-based unimodal system 
and speech-based unimodal system fails to meet the 
requirement. Fusion at the matching- score level is used to 
improve the performance of the system. The psychological 
effects of such multimodal system should also not be 
disregarded and it is likely that a system using multiple 
modalities would seem harder to cheat to any potential 
impostors. 

In the future we plan to test whether setting the user 
specific weights to different modalities can be used to improve 
a system's performance. 
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